CU2e77a5x - Add a CDB merge function #373

adam-sutton-1992 · 2023-11-22T17:54:21Z

This is early stages of this enhancement, and I'll list what I've done, questions I have, and a TODO list.

What I've done:

I've made it a static method within the cdb class so the parameters are not edited upon passing them - incase the user has further use for them.
Copy cdb1 and its config. I consider this cdb to be the base.
For all the name2x dict's I check if they exist in both cdbs and try to merge them in the most suitable way (i.e. name2count_train will add the number of training samples together).
If the name key only exists in cdb2 I add that key and value to the new cdb
I merged snames using a set union
I do a similar process for cui2x as I did for names, and merge them as suitable.
For context vectors where both CUIs exist I perform a weighted sum, where the weights are sourced from the cui2count_train dict. I check for each window length accordingly.
vocab sums the occurences for each word from both cdbs
addl_info prioritises adding cdb1 entries over cdb2 if keys exist in both cdbs.

Questions:

Is there a better way of checking the existance of keys in each dictionary than what I've done? I've gone for what I think is most readable, but with the fewest calls to save time. - such as only looping through one cdb, not two.
Is the weighted average function correct? Are the context vectors are always generated from the same embedding space?
Are there alternative methods wanted for meging context vectors?
What is cui2info used for? There is no documentation for it, and it seems redundant compared to addl_info.
How would we handle merging two addl_info with the same keys?
What are the ideal implementations for overwrite and vector_import params?
What dictionaries are safe to assume to be "always popluated"
Will all dictionaries that are populated have the same entries?

TODO:

Implement parameters added
Write tests for all edge cases
Handle KeyErrors for when cuis exist for context_vectors but don't for count_train

medcat/cdb.py

adam-sutton-1992 · 2023-11-27T11:10:46Z

Thanks for the comments. I've made some of the suggested changes locally already, when I started testing it. I'm also looking at using medcat.cdb.add_names instead of manually merging the names dictionaries, as that should be safer / modular.

I'll upload them shortly.

Regarding where this method should go - I don't see an ideal home for it currently in medcat.utils. CDBMaker could be a good home for it, although that probably would require some cleaning up.

mart-r · 2023-11-27T12:23:48Z

I'll try and answer some of the questions as well.

Is there a better way of checking the existance of keys in each dictionary than what I've done? I've gone for what I think is most readable, but with the fewest calls to save time. - such as only looping through one cdb, not two.
- I can't think of a way to make this much better. In principle, if we wanted to blindly merge every dict, we could pass the the pairs of dicts to a separate method that deals with them identically. But the merging isn't identical for all of these. Plus the fact that we don't always know which dicts do and which ones don't get populated
Is the weighted average function correct? Are the context vectors are always generated from the same embedding space?
- I think it makes sense. I suppose there could be a different embedding space if different vocabs are used? Though I don't know how common that is.
Are there alternative methods wanted for merging context vectors?
- I don't think so since the individual trained data isn't retained so you can't go back and re-train for everything (which in theory might be a better option)
What is cui2info used for? There is no documentation for it, and it seems redundant compared to addl_info.
- I'm sure you saw that medcat.utils.helpers uses it for some kind of grouping and ICD10 mappings. But I don't actually know if that's used by anyone. Though it very well might be.
How would we handle merging two addl_info with the same keys?
- That's going to be difficult indeed. There's really not a lot of oversight into what goes or does not go there
- With that said, if someone's got a very complicated set up, they should do their own custom merging
- But we should probably at the very least warn if we encounter something we are unsure on how to merge
What are the ideal implementations for overwrite and vector_import params?
- I don't actually know what purpose they're supposed to have
- Though I could see one wanting to use overwrite to change the first of the two CDBs in place instead of returning a new one
What dictionaries are safe to assume to be "always popluated"
- There isn't really much consistency here all in all
- I ran a simple script over the 11 models I had to see which ones were populated. Take from it what you will
  - 'name2cuis2status': 11
  - 'vocab': 11
  - 'addl_info': 11
  - 'name2cuis': 11
  - 'cui2names': 9
  - 'cui2type_ids': 9
  - 'cui2context_vectors': 9
  - 'cui2count_train': 9
  - 'name2count_train': 8
  - 'name_isupper': 7
  - 'cui2preferred_name': 7
  - 'cui2snames': 6
  - 'cui2tags': 2
  - 'cui2average_confidence': 1
- I ran the script again on the 19 models I've got, with more details
  - Bare in mind, some of these models probably haven't received any training
  - Which is why they'll have nothing in e.g cui2count_train
- Here's the results
  - vocab 1.00 (19 / 19)
  - name2cuis2status 1.00 (19 / 19)
  - addl_info 1.00 (19 / 19)
  - name2cuis 1.00 (19 / 19)
  - name_isupper 0.79 (15 / 19)
  - cui2type_ids 0.79 (15 / 19)
  - snames 0.79 (15 / 19)
  - cui2names 0.79 (15 / 19)
  - cui2context_vectors 0.68 (13 / 19)
  - cui2count_train 0.68 (13 / 19)
  - name2count_train 0.68 (13 / 19)
  - cui2snames 0.63 (12 / 19)
  - cui2preferred_name 0.37 (7 / 19)
  - cui2tags 0.11 (2 / 19)
  - _memory_optimised_parts 0.11 (2 / 19)
  - cui2average_confidence 0.05 (1 / 19)
Will all dictionaries that are populated have the same entries?
- If they're populated, they should be populated for everything. But I don't think there's any guarantee

adam-sutton-1992 · 2023-11-27T15:18:10Z

About the context_vectors, and how to test merging them appropriately.

It is difficult to test if merging two cdbs will result in appropriate context_vectors calculated via a weighted average.

To test this we are going to:

Create/source a dummy document(s). This can be very small / unaccurate.
Train cdb1 and cdb2 on a split of the documents. Train super_cdb on the entireity of the document.
Mergecdb1 and cdb2 to make merge_cdb, do the context vectors match up with super_cdb.
We would expect the context vectors between merge_cdb and super_cdb to largely be the same (i.e. similarity > 0.9xxx - allowing for floating point differences

We have considered the case of negative samples affecting the context vectors - as they are random. Given a large enough corpus to train this wouldn't be an issue, however it is beyond scope for this kind of testing. So we will set the number of negative samples to 0. In a practical sense this would not be ideal, but here it is a good workaround for testing.

We would also add documentation to indicate that context vectors via merging will not be identical as if they were generated at the same time.

mart-r · 2023-12-12T10:40:28Z

Please make the GHA (flake8, mypy, tests) run successfully.
I would recommend running locally before pushing. Even though you may not cover all python versions (though you could), you'd probably still cover most things.
I just use the following in my pre push hook:

flake8 medcat
python -m mypy --follow-imports=normal medcat

(tests take a long time so I don't like them in the hook - I run them separately).

And you could then also update the top description as to what the current state of the PR is - what's been done and what's still a WIP. Perhaps convert he questions and/or TODOs to a task list with check points. Though would be great to retain stuff that fell out of the scope for whatever reason.

adam-sutton-1992 · 2023-12-12T14:53:16Z

I've fixed the tests - I had issues with flake8 locally. Flake8 6.0 is compatible with our flake8 config file.

mart-r · 2023-12-12T15:07:01Z

I've fixed the tests - I had issues with flake8 locally. Flake8 6.0 is compatible with our flake8 config file.

requirements-dev.txt and CONTRIBUTING - we've set things up with v4.0.1.
Though there may well be reasons to upgrade. You can make a CU task if you think there is.

I'll take a look at the entire thing later today.

adam-sutton-1992 · 2023-12-12T15:07:59Z

Just an issue on my local machine. I'm writing up a list now of the changes I've made and if there's anything else left to do.

mart-r · 2023-12-12T15:31:46Z

Just an issue on my local machine.

Just saying it's easier if you keep your virtual environment in sync with the project requirements :)

mart-r

Overall I think this looks good.
But I think we can improve a few small things.

mart-r · 2023-12-12T16:50:25Z

tests/test_cdb.py

@@ -90,5 +90,44 @@ def test_cui2snames_population(self):
            with self.subTest(cui):
                self.assertIn(cui, self.undertest.cui2snames)

+
+    def test_merge_cdb(self):


This one test seems to test a lot of different things. And have a lot of setup in the actual test method.

What I would suggest is a new class (unittest.TestCase) where in the class setup method, the two cuis are generated and merged. And then having various small test methods where each of the specific things is tested for.

Though in principle, these different test cases (i.e the different changes that you're testing for within the merge) might also benefit from being separated into different classes. That way we'd be doing (and testing for) one change at a time. But I think it's fine to do it all at once here. Tests already take a long time, would hate to slow them down even further. We can revisit at a later date if/when this becomes an issue.

I was personally unhappy with the test method being where it is, and not being very elegant.

I think if we move the method its own utils class elsewhere in medcat then splitting these out would be ideal.

Unless you see a reason to use this outside of tests, I don't think it fits in the medcat package itself.
Though there is a tests.helper module you could move some boilerplate to.
On the other hand, if this piece of code is only relevant to testing CDB merging, creating a new module for testing the merge capabilities and having the helper method there would probably make more sense.

mart-r · 2023-12-12T16:50:28Z

tests/test_cdb.py

+        maker2 = CDBMaker(config) # second maker is required as it will otherwise point to same object
+        path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "model_creator", "umls_sample.csv")
+        cdb1 = maker1.prepare_csvs(csv_paths=[path])
+        cdb2 = maker2.prepare_csvs(csv_paths=[path])


Perhaps we can get away with 1 CDBMaker just making two identical CDBs? That seems to be the way this class works anyway (the CDBMaker is created once upon class setup and a separate CDB is made in the setup for each individual test).

That is how I assumed it would work.

Making two cdbs from the same maker seemed to point the same cdb1. i.e.

maker = CDBMaker(config) path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "model_creator", "umls_sample.csv") cdb1 = maker.prepare_csvs(csv_paths=[path]) cdb2 = maker.prepare_csvs(csv_paths=[path]) print(cdb1 == cdb2)

returns true.

I'm happy to change this if I'm mistaken or have overlooked something in the building. The same config seems to be fine for CDBMaker.

I think you're right - it just modifies the same instance each time the prepare_csvs method is called.
Though merging two identical CDBs to return a third identical one could also be an edge case to test.

But what this means is that the way the entire class works is prone to side effects since in each setUp it looks like a new CDB is generated, whereas in reality the same instance is modified and returned.

tests/test_cdb.py

medcat/cdb.py

adam-sutton-1992 · 2023-12-12T17:59:04Z

Changes made to medcat.cdb:

Removed a lot of manual editing of cui/name dictionaries in place, in return I've used the medcat.cdb method add_concept. This method has additional checks and can gracefully handle some dictionaries stored in medcat.cdb.addl_info.
Context vectors now account for prioritizing the training from either CDB provided overwrite_training is provided indicating which cdb is the priority. This has been explained as useful when merging CDB's with one CDB coming from a more specialised traning process (i.e. cancer cui's should be better represented from a CDB that has supervised training in that domain). cui2count_train is also handled according to the overwrite parameter.
If this parameter isn't provided a weighted average is used in generating new context vectors.
snames and vocab dicts are handled the same as previous.

Added testing to tests.test_cdb.py:

I generate two cdb using the umls_sample.csv.
I then introduce some changes to each cdb; new cuis, new context vectors, additional training counts.
I merge the cdbs
I test new cuis being in the merged cdb as intended, along with testing the weighted average functionality.

These tests might not be extensive, if theres any tests you'd like to see let me know and I'll make sure to add them.

Regarding a cdb generated from a full dataset (full_cdb) vs a merged cdb (merge_cdb) from two cdbs split on the same dataset:

We agreed that this test isn't suitable for unit testing. This test was proven to be quite difficult to collect suitable results on very small training datasets. I managed to test with and without negative sampling and found similar results between them regardless. Due to the small size of the training document - the vectors generated in both cdbs were orthogonal to each other, so I added some overlap between the documents. This increased the similarity between the cdbs, and then when comparing a fully trained cdb to a merged one the performances were as expected.

For a fairer comparison I tested the vector merging method from medcat.cdb.import_training. Where we have a weighted average, import_training is just an mean of both context vectors. The similarities between the merged_cdb and full_cdb in both cases were very similar, with the weighted average being slightly higher.

Potential changes to be done:

Where does this go? At the moment it's housed in medcat.cdb. We could also find a more suitable place for it in medcat.cdb_maker or medcat.utils. I think cdb_maker is a fair place, as it makes a cdb and could then use the config that is passed to the CDBMaker for the new, merged CDB.
overwrite_training essentially changes the weighting to 0 and 1 for each CDB. It could also be done via passing a tuple that would decide any range of weighting. I'm unsure of any benefit are arbitarily choosing the weighting when generating context vectors.
More unit testing. There may be more edge cases that are worth testing.

mart-r · 2023-12-13T09:42:50Z

Where does this go? At the moment it's housed in medcat.cdb. We could also find a more suitable place for it in medcat.cdb_maker or medcat.utils. I think cdb_maker is a fair place, as it makes a cdb and could then use the config that is passed to the CDBMaker for the new, merged CDB.

The only reason I'd be inclined to not put the method in cdb_maker is the fact that generally this module isn't used anywhere other than when generating a new CDB.
So if we truly wanted it not to be in this module, I'd be inclined to have it in its own module in the medcat.utils package.
Though in either case, if we move it out we loose the discoverability of having it attached to the CDB.
(we could have a method in the CDB that delegates to the actual method somewhere else, but we'd end up duplicating the docstring, which I'm not really a fan of).

overwrite_training essentially changes the weighting to 0 and 1 for each CDB. It could also be done via passing a tuple that would decide any range of weighting. I'm unsure of any benefit are arbitarily choosing the weighting when generating context vectors.

More unit testing. There may be more edge cases that are worth testing.

I think it might be worth trying to set up tests for the different overwrite_training values.
And technically, we'd probably want something for different test for full_build=True as well, but I don't know if you've got a good use case to demonstrate its effectiveness.

tomolopolis

lgtm -

Couple of things though.

Can we move to a medcat.utils.cdb module.
- rationale for moving it into this new module is that there is a cdb filter method somewhere that should also go here
A tutorial in how to use this functionality would nice. Perhaps in a specialised tutorial. Called something like CDB management? https://github.com/CogStack/MedCATtutorials
responding to thread RE tests: side effects on tests is a recipe for future frustration. Please fix with setup local to class in the test etc.

mart-r · 2023-12-13T13:58:48Z

responding to thread RE tests: side effects on tests is a recipe for future frustration. Please fix with setup local to class in the test etc.

I agree. But I think it's outside the scope of this PR since it's the pre-existing CDBTests setup that's introducing that.
I'll create a CU task. And do a quick PR.

adam-sutton-1992 · 2023-12-13T20:46:41Z

Few more changes which I think finalise this:

Set up different tests for overwrite_training=True and full_build.
Moved the set up of CDBs into tests.helper as suggested
Bug fixes due to added testing (thanks)
Removed redundant checks for certain cases

If theres anything before I merge the request I just want to check if there are any further changes I've missed in the previous suggestions.

Regarding the move of this work to another module, should that be another CU also?

mart-r · 2023-12-14T09:16:49Z

Few more changes which I think finalise this:

Set up different tests for overwrite_training=True and full_build.

Moved the set up of CDBs into tests.helper as suggested

Bug fixes due to added testing (thanks)

Removed redundant checks for certain cases

That all looks great!

If theres anything before I merge the request I just want to check if there are any further changes I've missed in the previous suggestions.

I think the tests could still benefit from being being separated into multiple tests. Especially since you added more tests to the same test method.
I.e something along the lines:

class CDBMergeTests(unittest.TestCase):
    @classmethod
    def setUpClass(cls) -> None:
        to_merge = ForCDBMerging()
        cls.cdb1 = to_merge.cdb1
        cls.cdb2 = to_merge.cdb2
        cls.merged_cdb = CDB.merge_cdb(cdb1=cdb1, cdb2=cdb2)
        cls.merged_overwrite_cdb = CDB.merge_cdb(cdb1=cdb1, cdb2=cdb2, overwrite_training=2, full_build=True)
   # tests for different things - properly named and (most likely) one per assert

By separating the tests into multiple properly named ones, we'll have more granularity about what breaks (if it does) in the future, instead of just "test_cdb_merge" failed, it'd say something like "test_cdb_merge_name_in_cui2names" failed.

Regarding the move of this work to another module, should that be another CU also?

If Tom said he'd like it in medcat.utils.cdb, then that's what it'll be :)
So I'd say move it in here. Though don't bother looking for the other thing he said should go there, that's definitely outside the scope of this PR.

adam-sutton-1992 · 2023-12-14T18:31:50Z

I've moved everything over, and split out the tests.

I've named the medcat.utils.cdb module as medcat.utils.cdb_utils keeping form with similar modules. I've also created the testing.utils.test_cdb_utils module.

I'll merge when these changes get a thums up.

medcat/utils/cdb_utils.py

tests/utils/test_cdb_utils.py

mart-r · 2023-12-15T08:36:17Z

Once the two comments I've made are resolved, I'd be happy for a merge.

tests/utils/test_cdb_utils.py

tests/test_cdb.py

mart-r · 2023-12-15T10:26:04Z

Looks all good to me now.

Wait for GHA to do its magic and we'll be good to merge on my end.

mart-r · 2024-01-02T11:00:37Z

@adam-sutton-1992
For future reference, let's try and do a "Squash and merge" rather than the regular marge (there's an option under the arrow for a PR and it'll remember your preference). That way we avoid a bunch of micro-commits on the master branch that are only really relevant in the context of a specific PR instead of the entire project.

adam-sutton-1992 · 2024-01-02T12:49:27Z

@mart-r

Hi sorry, I will do in the future.

Thanks.

Initial commit for merge_cdb method

81ba0bf

adam-sutton-1992 added the enhancement New feature or request label Nov 22, 2023

adam-sutton-1992 requested review from tomolopolis and mart-r November 22, 2023 17:54

adam-sutton-1992 self-assigned this Nov 22, 2023

adam-sutton-1992 added 6 commits November 22, 2023 23:47

Added indentation to make merge_cdb a class method

379a0db

fixed syntax issues

e64b2e0

more lint fixes

eefb010

more lint fixes

ff48a2a

bug fixes of merge_cdb

f299677

removed print statements

abb68b5

mart-r reviewed Nov 27, 2023

View reviewed changes

medcat/cdb.py Outdated Show resolved Hide resolved

medcat/cdb.py Outdated Show resolved Hide resolved

mart-r reviewed Nov 27, 2023

View reviewed changes

medcat/cdb.py Outdated Show resolved Hide resolved

adam-sutton-1992 added 3 commits November 29, 2023 12:59

intermediate changes of merge_cdb and testing function

900439a

changes to merge_cdb and adding unit tests for method

6a820f0

resolves merge conflict of imports

3f9bc68

adam-sutton-1992 added 2 commits December 12, 2023 14:06

fixing lint issues

f96758a

fixing flake8 linting

1975b1c

mart-r approved these changes Dec 12, 2023

View reviewed changes

tomolopolis approved these changes Dec 13, 2023

View reviewed changes

adam-sutton-1992 added 2 commits December 13, 2023 20:09

bug fixes, additional tests, and more documentation

6f752c8

moved set up of cdbs to be merged to tests.helper

7d694f2

moved merge_cdb to utils and created test_cdb_utils

7cdd208

mart-r reviewed Dec 15, 2023

View reviewed changes

medcat/utils/cdb_utils.py Outdated Show resolved Hide resolved

mart-r reviewed Dec 15, 2023

View reviewed changes

tests/utils/test_cdb_utils.py Outdated Show resolved Hide resolved

removed class wrapper in cdb utils and fixed class set up in tests

fe9ef66

mart-r reviewed Dec 15, 2023

View reviewed changes

tests/utils/test_cdb_utils.py Outdated Show resolved Hide resolved

changed test object setup to class setup

f70e61d

mart-r reviewed Dec 15, 2023

View reviewed changes

tests/test_cdb.py Outdated Show resolved Hide resolved

removed erroneous new line

c74fe1f

adam-sutton-1992 merged commit 70305f4 into master Dec 18, 2023
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CU2e77a5x - Add a CDB merge function #373

CU2e77a5x - Add a CDB merge function #373

adam-sutton-1992 commented Nov 22, 2023 •

edited

Loading

adam-sutton-1992 commented Nov 27, 2023 •

edited

Loading

mart-r commented Nov 27, 2023 •

edited

Loading

adam-sutton-1992 commented Nov 27, 2023 •

edited

Loading

mart-r commented Dec 12, 2023

adam-sutton-1992 commented Dec 12, 2023

mart-r commented Dec 12, 2023

adam-sutton-1992 commented Dec 12, 2023

mart-r commented Dec 12, 2023

mart-r left a comment

mart-r Dec 12, 2023

adam-sutton-1992 Dec 12, 2023

mart-r Dec 13, 2023

mart-r Dec 12, 2023

adam-sutton-1992 Dec 12, 2023

mart-r Dec 13, 2023

adam-sutton-1992 commented Dec 12, 2023

mart-r commented Dec 13, 2023 •

edited

Loading

tomolopolis left a comment

mart-r commented Dec 13, 2023

adam-sutton-1992 commented Dec 13, 2023

mart-r commented Dec 14, 2023 •

edited

Loading

adam-sutton-1992 commented Dec 14, 2023 •

edited

Loading

mart-r commented Dec 15, 2023

mart-r commented Dec 15, 2023

mart-r commented Jan 2, 2024

adam-sutton-1992 commented Jan 2, 2024

CU2e77a5x - Add a CDB merge function #373

CU2e77a5x - Add a CDB merge function #373

Conversation

adam-sutton-1992 commented Nov 22, 2023 • edited Loading

adam-sutton-1992 commented Nov 27, 2023 • edited Loading

mart-r commented Nov 27, 2023 • edited Loading

adam-sutton-1992 commented Nov 27, 2023 • edited Loading

mart-r commented Dec 12, 2023

adam-sutton-1992 commented Dec 12, 2023

mart-r commented Dec 12, 2023

adam-sutton-1992 commented Dec 12, 2023

mart-r commented Dec 12, 2023

mart-r left a comment

Choose a reason for hiding this comment

mart-r Dec 12, 2023

Choose a reason for hiding this comment

adam-sutton-1992 Dec 12, 2023

Choose a reason for hiding this comment

mart-r Dec 13, 2023

Choose a reason for hiding this comment

mart-r Dec 12, 2023

Choose a reason for hiding this comment

adam-sutton-1992 Dec 12, 2023

Choose a reason for hiding this comment

mart-r Dec 13, 2023

Choose a reason for hiding this comment

adam-sutton-1992 commented Dec 12, 2023

mart-r commented Dec 13, 2023 • edited Loading

tomolopolis left a comment

Choose a reason for hiding this comment

mart-r commented Dec 13, 2023

adam-sutton-1992 commented Dec 13, 2023

mart-r commented Dec 14, 2023 • edited Loading

adam-sutton-1992 commented Dec 14, 2023 • edited Loading

mart-r commented Dec 15, 2023

mart-r commented Dec 15, 2023

mart-r commented Jan 2, 2024

adam-sutton-1992 commented Jan 2, 2024

adam-sutton-1992 commented Nov 22, 2023 •

edited

Loading

adam-sutton-1992 commented Nov 27, 2023 •

edited

Loading

mart-r commented Nov 27, 2023 •

edited

Loading

adam-sutton-1992 commented Nov 27, 2023 •

edited

Loading

mart-r commented Dec 13, 2023 •

edited

Loading

mart-r commented Dec 14, 2023 •

edited

Loading

adam-sutton-1992 commented Dec 14, 2023 •

edited

Loading